IPython (Interactive Python) is an enhanced Python shell which provides a more robust and productive development environment for users. There are several key features that set it apart from the standard Python shell.
In IPython, all your inputs and outputs are saved. There are two variables named In
and Out
which are assigned as you work with your results. All outputs are saved automatically to variables of the form _N
, where N
is the prompt number, and inputs to _iN
. This allows you to recover quickly the result of a prior computation by referring to its number even if you forgot to store it as a variable.
In [ ]:
import numpy as np
np.sin(4)**2
In [ ]:
_1
In [ ]:
_i1
In [ ]:
_1 / 4.
In [ ]:
some_dict = {}
some_dict?
If available, additional detail is provided with two question marks, including the source code of the object itself.
In [ ]:
from numpy.linalg import cholesky
cholesky??
This syntax can also be used to search namespaces with wildcards (*
).
In [ ]:
import numpy as np
np.random.rand*?
In [ ]:
np.ar
This can even be used to help with specifying arguments to functions, which can sometimes be difficult to remember:
In [ ]:
plt.hist
In [ ]:
ls /Users/fonnescj/repositories/scientific-python-workshop/
Virtually any system command can be accessed by prepending !
, which passes any subsequent command directly to the OS.
In [ ]:
!locate python | grep pdf
You can even use Python variables in commands sent to the OS:
In [ ]:
file_type = 'csv'
!ls ../data/*$file_type
The output of a system command using the exclamation point syntax can be assigned to a Python variable.
In [ ]:
data_files = !ls ../data/microbiome/
In [ ]:
data_files
If you type at the system prompt:
$ ipython qtconsole
instead of opening in a terminal, IPython will start a graphical console that at first sight appears just like a terminal, but which is in fact much more capable than a text-only terminal. This is a specialized terminal designed for interactive scientific work, and it supports full multi-line editing with color highlighting and graphical calltips for functions, it can keep multiple IPython sessions open simultaneously in tabs, and when scripts run it can display the figures inline directly in the work area.
Over time, the IPython project grew to include several components, including:
As each component has evolved, several had grown to the point that they warrented projects of their own. For example, pieces like the notebook and protocol are not even specific to Python. As the result, the IPython team created Project Jupyter, which is the new home of language-agnostic projects that began as part of IPython, such as the notebook in which you are reading this text.
The HTML notebook that is part of the Jupyter project supports interactive data visualization and easy high-performance parallel computing.
In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
def f(x):
return (x-3)*(x-5)*(x-7)+85
import numpy as np
x = np.linspace(0, 10, 200)
y = f(x)
plt.plot(x,y)
The notebook lets you document your workflow using either HTML or Markdown.
The Jupyter Notebook consists of two related components:
The Notebook can be used by starting the Notebook server with the command:
$ ipython notebook
This initiates an iPython engine, which is a Python instance that takes Python commands over a network connection.
The IPython controller provides an interface for working with a set of engines, to which one or more iPython clients can connect.
The Notebook gives you everything that a browser gives you. For example, you can embed images, videos, or entire websites.
In [ ]:
from IPython.display import IFrame
IFrame('https://jupyter.org', width='100%', height=350)
In [ ]:
from IPython.display import YouTubeVideo
YouTubeVideo("rl5DaFbLc60")
Markdown is a simple markup language that allows plain text to be converted into HTML.
The advantages of using Markdown over HTML (and LaTeX):
For example, instead of writing:
<p>In order to create valid
<a href="http://en.wikipedia.org/wiki/HTML">HTML</a>, you
need properly coded syntax that can be cumbersome for
“non-programmers” to write. Sometimes, you
just want to easily make certain words <strong>bold
</strong>, and certain words <em>italicized</em> without
having to remember the syntax. Additionally, for example,
creating lists:</p>
<ul>
<li>should be easy</li>
<li>should not involve programming</li>
</ul>
we can write the following in Markdown:
In order to create valid [HTML], you need properly
coded syntax that can be cumbersome for
"non-programmers" to write. Sometimes, you just want
to easily make certain words **bold**, and certain
words *italicized* without having to remember the
syntax. Additionally, for example, creating lists:
* should be easy
* should not involve programming
Markdown uses *
(asterisk) and _
(underscore) characters as
indicators of emphasis.
*italic*, _italic_
**bold**, __bold__
***bold-italic***, ___bold-italic___
italic, italic
bold, bold
bold-italic, bold-italic
Markdown supports both unordered and ordered lists. Unordered lists can use *
, -
, or
+
to define a list. This is an unordered list:
* Apples
* Bananas
* Oranges
Ordered lists are numbered lists in plain text:
1. Bryan Ferry
2. Brian Eno
3. Andy Mackay
4. Paul Thompson
5. Phil Manzanera
Markdown inline links are equivalent to HTML <a href='foo.com'>
links, they just have a different syntax.
[Biostatistics home page](http://biostat.mc.vanderbilt.edu "Visit Biostat!")
Block quotes are denoted by a >
(greater than) character
before each line of the block quote.
> Sometimes a simple model will outperform a more complex model . . .
> Nevertheless, I believe that deliberately limiting the complexity
> of the model is not fruitful when the problem is evidently complex.
Sometimes a simple model will outperform a more complex model . . . Nevertheless, I believe that deliberately limiting the complexity of the model is not fruitful when the problem is evidently complex.
Images look an awful lot like Markdown links, they just have an extra
!
(exclamation mark) in front of them.
![Python logo](images/python-logo-master-v3-TM.png)
In [ ]:
# %load http://matplotlib.org/mpl_examples/shapes_and_collections/scatter_demo.py
"""
Simple demo of a scatter plot.
"""
import numpy as np
import matplotlib.pyplot as plt
N = 50
x = np.random.rand(N)
y = np.random.rand(N)
colors = np.random.rand(N)
area = np.pi * (15 * np.random.rand(N))**2 # 0 to 15 point radiuses
plt.scatter(x, y, s=area, c=colors, alpha=0.5)
plt.show()
Mathjax ia a javascript implementation $\alpha$ of LaTeX that allows equations to be embedded into HTML. For example, this markup:
"""$$ \int_{a}^{b} f(x)\, dx \approx \frac{1}{2} \sum_{k=1}^{N} \left( x_{k} - x_{k-1} \right) \left( f(x_{k}) + f(x_{k-1}) \right). $$"""
becomes this:
$$ \int_{a}^{b} f(x)\, dx \approx \frac{1}{2} \sum_{k=1}^{N} \left( x_{k} - x_{k-1} \right) \left( f(x_{k}) + f(x_{k-1}) \right). $$
In [ ]:
from sympy import *
init_printing()
x, y = symbols("x y")
In [ ]:
eq = ((x+y)**2 * (x+1))
eq
In [ ]:
expand(eq)
In [ ]:
(1/cos(x)).series(x, 0, 6)
In [ ]:
limit((sin(x)-x)/x**3, x, 0)
In [ ]:
diff(cos(x**2)**2 / (1+x), x)
In [ ]:
%lsmagic
Timing the execution of code; the timeit
magic exists both in line and cell form:
In [ ]:
%timeit np.linalg.eigvals(np.random.rand(100,100))
In [ ]:
%%timeit a = np.random.rand(100, 100)
np.linalg.eigvals(a)
IPython also creates aliases for a few common interpreters, such as bash, ruby, perl, etc.
These are all equivalent to %%script <name>
In [ ]:
%%ruby
puts "Hello from Ruby #{RUBY_VERSION}"
In [ ]:
%%bash
echo "hello from $BASH"
IPython has an rmagic
extension that contains a some magic functions for working with R via rpy2. This extension can be loaded using the %load_ext
magic as follows:
In [ ]:
%load_ext rpy2.ipython
If the above generates an error, it is likely that you do not have the rpy2
module installed. You can install this now via:
In [ ]:
!pip install rpy2
or, if you are running Anaconda, via conda
:
In [ ]:
!conda install rpy2
In [ ]:
x,y = np.arange(10), np.random.normal(size=10)
%R print(lm(rnorm(10)~rnorm(10)))
In [ ]:
%%R -i x,y -o XYcoef
lm.fit <- lm(y~x)
par(mfrow=c(2,2))
print(summary(lm.fit))
plot(lm.fit)
XYcoef <- coef(lm.fit)
In [ ]:
XYcoef
In [ ]:
def div(x, y):
return x/y
div(1,0)
In [ ]:
%debug
In [ ]:
!jupyter nbconvert --to html "IPython and Jupyter.ipynb"
Currently, nbconvert
supports HTML (default), LaTeX, Markdown, reStructuredText, Python and HTML5 slides for presentations. Some types can be post-processed, such as LaTeX to PDF (this requires Pandoc to be installed, however).
In [ ]:
!jupyter nbconvert --to pdf "Introduction to pandas.ipynb"
A very useful online service is the IPython Notebook Viewer which allows you to display your notebook as a static HTML page, which is useful for sharing with others:
In [ ]:
IFrame("http://nbviewer.ipython.org/2352771", width='100%', height=350)
As of this year, GitHub supports the rendering of Jupyter Notebooks stored on its repositories.
reproducing conclusions from a single experiment based on the measurements from that experiment
The most basic form of reproducibility is a complete description of the data and associated analyses (including code!) so the results can be exactly reproduced by others.
Reproducing calculations can be onerous, even with one's own work!
Scientific data are becoming larger and more complex, making simple descriptions inadequate for reproducibility. As a result, most modern research is irreproducible without tremendous effort.
Reproducible research is not yet part of the culture of science in general, or scientific computing in particular.
There are a number of steps to scientific endeavors that involve computing:
Many of the standard tools impose barriers between one or more of these steps. This can make it difficult to iterate, reproduce work.
The Jupyter notebook eliminates or reduces these barriers to reproducibility.
The IPython architecture consists of four components, which reside in the ipyparallel
package:
Engine The IPython engine is a Python instance that accepts Python commands over a network connection. When multiple engines are started, parallel and distributed computing becomes possible. An important property of an IPython engine is that it blocks while user code is being executed.
Hub The hub keeps track of engine connections, schedulers, clients, as well as persist all task requests and results in a database for later use.
Schedulers All actions that can be performed on the engine go through a Scheduler. While the engines themselves block when user code is run, the schedulers hide that from the user to provide a fully asynchronous interface to a set of engines.
Client The primary object for connecting to a cluster.
This architecture is implemented using the ØMQ messaging library and the associated Python bindings in pyzmq
.
To enable the IPython Clusters tab in Jupyter Notebook:
ipcluster nbextension enable
When you then start a Jupyter session, you should see the following in your IPython Clusters tab:
Before running the next cell, make sure you have first started your cluster, you can use the clusters tab in the dashboard to do so.
Select the number if IPython engines (nodes) that you want to use, then click Start.
In [ ]:
from ipyparallel import Client
client = Client()
dv = client.direct_view()
In [ ]:
len(dv)
In [ ]:
def where_am_i():
import os
import socket
return "In process with pid {0} on host: '{1}'".format(
os.getpid(), socket.gethostname())
In [ ]:
where_am_i_direct_results = dv.apply(where_am_i)
where_am_i_direct_results.get()
Let's now consider a useful function that we might want to run in parallel. Here is a version of the approximate Bayesian computing (ABC) algorithm.
In [ ]:
import numpy
def abc(y, N, epsilon=[0.2, 0.8]):
trace = []
while len(trace) < N:
# Simulate from priors
mu = numpy.random.normal(0, 10)
sigma = numpy.random.uniform(0, 20)
x = numpy.random.normal(mu, sigma, 50)
#if (np.linalg.norm(y - x) < epsilon):
if ((abs(x.mean() - y.mean()) < epsilon[0]) &
(abs(x.std() - y.std()) < epsilon[1])):
trace.append([mu, sigma])
return trace
In [ ]:
y = numpy.random.normal(4, 2, 50)
Let's try running this on one of the cluster engines:
In [ ]:
dv0 = client[0]
dv0.block = True
dv0.apply(abc, y, 10)
This fails with a NameError because NumPy has not been imported on the engine to which we sent the task. Each engine has its own namespace, so we need to import whatever modules we will need prior to running our code:
In [ ]:
dv0.execute("import numpy")
In [ ]:
dv0.apply(abc, y, 10)
An easier approach is to use the parallel cell magic to import everywhere:
In [ ]:
%%px
import numpy
This magic can be used to execute the same code on all nodes.
In [ ]:
%%px
import os
print(os.getpid())
In [ ]:
%%px
%matplotlib inline
import matplotlib.pyplot as plt
import os
tsamples = numpy.random.randn(100)
plt.hist(tsamples)
_ = plt.title('PID %i' % os.getpid())
IPython Notebook Viewer Displays static HTML versions of notebooks, and includes a gallery of notebook examples.
A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data A landmark example of reproducible research in genomics: Git repo, iPython notebook, data and scripts.
Jacques Ravel and K Eric Wommack. 2014. All Hail Reproducibility in Microbiome Research. Microbiome, 2:8.
Benjamin Ragan-Kelley et al.. 2013. Collaborative cloud-enabled tools allow rapid, reproducible biological insights. The ISME Journal, 7, 461–464; doi:10.1038/ismej.2012.123;